Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for more file types + archlinux packaging #9

Open
wants to merge 20 commits into
base: master
Choose a base branch
from

Conversation

neowutran
Copy link

@neowutran neowutran commented Apr 22, 2020

Main "features" of this pull request

  • Add more file type: Support PDF protected by password, and support office files (all files understood by libreoffice, with password protected file support)
  • Add archlinux packaging

Minor improvements

  • merge the man page with the readme to avoid text duplication (that does not get updated at the same frequency..)
  • Switch from ImageMagick to GraphicsMagick (it is a fork of ImageMagick, it is theoretically faster, with less features and less bug prone)

Things that need improvement / TODO:

  • How to quickly tests the software while developing it ? I created a new directory called "dev_tools", it contain some safe tests file and a script to automatically convert them.

  • Fix debian and fedora packaging

  • In the client script I added a code that read the pdf file to get its size
    wc -c "$INPUT_FILE" | cut -d ' ' -f 1
    I assumed that counting the number of bytes of a file is a operation simple enough to be bug free.
    This line is part of the changes I did to support password protected file from the GUI (The server script tell the client script "Hey! It need password!" Or "Hey! All fine, it doesn't need password!" depending on that, the GUI prompt a password form. So I needed avoid closing the file descriptor exec >&- to be able to keep communicating between client and server after the file have been transferred.
    Didn't found a way to avoid this bytes counting on the file.

Why this pull request

I found this project https://dangerzone.rocks https://github.com/firstlookmedia/dangerzone https://github.com/firstlookmedia/dangerzone-converter from Micahflee who implemented the idea for non-Qube based system and improved it to support more file type. So trying to implement it back for Qubes, and improving it a bit further to support password protected file and allow some more file type (still only the files supported by libreoffice, but I am way more permissive with mime type check)

Tests I did

The things in "dev_tools"
Tested it on archlinux, with gui and with cli.
Didn't tested it on debian/fedora yet

I see there is a pull request to re-implement the script from bash to python.
So this pull request will require modifications once the bash -> python is done

@marmarek
Copy link
Member

* How to quickly tests the software while developing it ? I created a new directory called "dev_tools", it contain some safe tests file and a script to automatically convert them.

qrexec really provides pipe-like connection between two processes, running in different VMs. You can emulate it locally with socat EXEC:client EXEC:server.

* This line is part of the changes I did to support password protected file from the GUI (The server script tell the client script "Hey! It need password!" Or "Hey! All fine, it doesn't need password!" depending on that, the GUI prompt a password form.  So I needed avoid closing the file descriptor `exec >&-` to be able to keep communicating between client and server after the file have been transferred.

I don't like this added interactive part to the protocol. Why not simply display a password prompt from within DispVM (server part)?

@neowutran
Copy link
Author

neowutran commented Apr 26, 2020

Ran into another issue:
Cannot convert password protected files on Debian. The "unoconv" package available on debian is way too old (~2017).
Even the version in SID is way too old.

So multiples options:

Just download the latest unoconv version from github https://github.com/unoconv/unoconv/blob/master/unoconv and running it on debian buster: work as expected

(Also, played a bit with libreoffice (in fact found nearly all that code on the internet but didn't saved the original url) macro to remove password)

REM  *****  BASIC  *****

Sub Main
   dim properties(0) as new com.sun.star.beans.PropertyValue
   url = convertToURL("/home/user/qubes-app-linux-pdf-converter/tools/files/password_protected.odt")
   properties(0).Name = "Password"
   properties(0).Value = "toor"
   properties(1).Name = "Hidden"
   properties(1).Value = True
   doc = StarDesktop.loadComponentFromUrl(url, "_blank", 0, properties())
   dim properties2(0) as new com.sun.star.beans.PropertyValue
   doc.storeAsURL("file:////home/user/qubes-app-linux-pdf-converter/tools/files/password_protected.not.odt", properties2())
End Sub

@ibokuri
Copy link
Contributor

ibokuri commented Apr 26, 2020

I see there is a pull request to re-implement the script from bash to python.

That's me! :D

I didn't know about Dangerzone when I started but you've certainly put me on to them now! Looks like the major differences are OCR, compressed PDF size, and multiple file types.

Right now, I'm just trying to stabilize my PR before adding any more new features. But, if you'd like, I certainly wouldn't mind some extra help on implementing those things afterwards. It seems you already have the multiple file types down, and porting them to Python wouldn't be too hard (he said hopefully).

@neowutran
Copy link
Author

Hi,
Don't worry, still have some issues to fix (aka: mostly Debian issues).
For the "compressed PDF size", from the few tests I did, the result is either bigger or of the same size (used the same command line as Dangerzone, but maybe more tests should be interesting)
(And for the OCR part, I have a mixed personal opinion: https://github.com/QubesOS/qubes-app-linux-pdf-converter/pull/9/files#diff-f8d345018d9d86cc6bd2b8c68545b447R48 )

@ibokuri
Copy link
Contributor

ibokuri commented May 1, 2020

And for the OCR part, I have a mixed personal opinion [...]

From the links you pointed out, it looks like OCR (well, Tesseract anyways) doesn't work on raw RGB bitmaps but more finalized image formats like PNG. Idk where Dangerzone does its OCR but we create PNGs client-side, meaning OCR would occur on the client as well, which is assumed to be safe.

The main things we need to worry about are what the server sends over: page count, image dimensions, and RGB bitmaps. Valid but incorrect submissions of these would be a pain to deal with.

@neowutran
Copy link
Author

neowutran commented May 2, 2020

Some updates about debian:
Created a new issue: https://bugs.launchpad.net/ubuntu/+source/unoconv/+bug/1876448
Didn't sent a pull request here for the moment https://salsa.debian.org/debian/unoconv ( I have trouble to understand debian packaging and why it need so many differents files )

Tried the following workaround (seems working but didn't tested it on all templates yet):

	libreoffice --accept='socket,host=localhost,port=2202;urp;' --norestore --nologo --nodefault >/dev/null 2>/dev/null &
	listener_notready=1
	
	# Wait until libreoffice server is started
	while [ $listener_notready -ne 0 ];
	do
		sleep 1
		netstat -anop 2> /dev/null | grep '127.0.0.1:2202' | grep LISTEN >/dev/null 2>/dev/null
		listener_notready=$?
	done

	# Remove password from file using libreoffice API
	# ...
		python3 -c '
import os
import uno
from com.sun.star.beans import PropertyValue
import sys

src="file://'"$INPUT_FILE"'"
dst="file://'"$INPUT_FILE.nopassword"'"
password="'"$PASSWORD"'"

localContext = uno.getComponentContext()
resolver = localContext.ServiceManager.createInstanceWithContext("com.sun.star.bridge.UnoUrlResolver", localContext)
ctx = resolver.resolve("uno:socket,host=localhost,port=2202;urp;StarOffice.ComponentContext")
smgr = ctx.ServiceManager
desktop = smgr.createInstanceWithContext("com.sun.star.frame.Desktop", ctx)

hidden_property = PropertyValue()
hidden_property.Name = "Hidden"
hidden_property.Value = True

password_property = PropertyValue()
password_property.Name = "Password"
password_property.Value = password

document = desktop.loadComponentFromURL(src, "_blank", 0, (password_property, hidden_property,))
document.storeAsURL(dst, ())' >&2
  # ...
	libreoffice --convert-to pdf "$INPUT_FILE.nopassword" --outdir /tmp/ >&2
	mv "$INPUT_FILE"".pdf" "$INPUT_FILE"

It remove the unoconv dependency but it add around 40 lines of code.

Update

The workaround seems to be working on all templates I tested (archlinux, debian buster, and fedora 32)

TODO:

  • Try to do a pull request for unoconv debian package
  • Wait a bit to see if something is moving for debian unoconv package. But very unlikely that anything land on stable quickly.
  • Verify that I fixed the old-stable debian packaging that I broke earlier
  • In archlinux package I added tests (try to convert some files before installing the package). Good idea / Bad idea ? If good idea, check how to add the archlinux "check" to debian and fedora packaging It also exist for debian, it is recommanded by archlinux and recommanded by debian. Didn't checked fedora yet.
  • Properly integrate thoses tests to https://github.com/QubesOS/qubes-app-linux-pdf-converter/blob/master/qubespdfconverter/tests.py (Better to way until the rewrite to python is over)

@neowutran
Copy link
Author

Pretty much all I wanted to add is done. Putting this pull request on hold until the port from bash to python is done and accepted. And then need to integrate those changes to the python port, but should be too hard

neowutran added a commit to neowutran/qubes-app-linux-pdf-converter that referenced this pull request Jul 4, 2020
@neowutran
Copy link
Author

Update: since the bash -> python rewrite is over, I started to rewrite this pull request in python.
Currently here https://github.com/neowutran/qubes-app-linux-pdf-converter/tree/QubesOS-master.
Once I reach a point where it is working fine, i will merge it back to master.
(PS: I am learning python at the same time)

neowutran added a commit to neowutran/qubes-app-linux-pdf-converter that referenced this pull request Jul 11, 2020
@neowutran
Copy link
Author

@Bl0nd could you do a first review of the python code ? :)
Should be working, generated the packages for fc32, buster & archlinux.
Tested on archlinux

But probably lot of rooms for improvements
Thanks ! :)

Copy link
Contributor

@ibokuri ibokuri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have Qubes installed right now so this is just from me looking over the code a bit.

In addition to my comments, here are some general thoughts:

  • As it stands, I'd rather not support password-protected PDFs. The main issues with this implementation are that passwords are specified as command-line arguments, you may only specify 1 password even though multiple files may have different passwords, and _decrypt() is ironically incredibly cryptic relative to the rest of the codebase.

  • It seems to me that --gui is just for password prompts? If so, I think it's a bit unnecessary. If password-protected PDFs were supported, GUI prompts are basically the only option considering that multiple files can be processed. Plus, GUI prompts shouldn't be created from the server (for security reasons), but from requests made to Dom0 which always provides a GUI.

  • Having to use raw sockets for multiple file type support... Is this really the only the way? 😅

Note that I didn't review much of server.py. That's mainly due to the thoughts I listed above.

qubespdfconverter/client.py Outdated Show resolved Hide resolved
qubespdfconverter/client.py Outdated Show resolved Hide resolved
qubespdfconverter/client.py Outdated Show resolved Hide resolved
qubespdfconverter/client.py Outdated Show resolved Hide resolved
qubespdfconverter/client.py Outdated Show resolved Hide resolved
Comment on lines 23 to 27
###########################
# The project "Dangerzone" reused the idea of this script based on:
# https://blog.invisiblethings.org/2013/02/21/converting-untrusted-pdfs-into-trusted.html
#
# - https://github.com/firstlookmedia/dangerzone-converter
# - https://dangerzone.rocks/
# - https://github.com/firstlookmedia/dangerzone
#
# Dangerzone try to export the idea to non Qubes based system, and try to improve it.
# Both projects can improve the other.
###########################
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, or at least not in this way.
My intend was to say to a potential contributor "Hey, a very similar project exist, and maybe we can take back some good ideas and re-implement them here".
IDK if:

  • Should not do that at all
  • Should do that, but way less verbose and/or somewhere else

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qubespdfconverter/server.py Outdated Show resolved Hide resolved
qubespdfconverter/server.py Outdated Show resolved Hide resolved
qubespdfconverter/server.py Outdated Show resolved Hide resolved
@neowutran
Copy link
Author

Thanks you very much for the review ! :)
I will take some time to fix that, probably next weekend.

* As it stands, I'd rather not support password-protected PDFs. The main issues with this implementation are that passwords are specified as command-line arguments, you may only specify 1 password even though multiple files may have different passwords, and `_decrypt()` is ironically incredibly cryptic relative to the rest of the codebase.

About the "_decrypt()", i sadly agree. I don't know how to improve the code of this part since it is mostly libreoffice API.
I think instead I will add comments to be more verbose about what it does exactly

* It seems to me that `--gui` is just for password prompts? If so, I think it's a bit unnecessary. If password-protected PDFs _were_ supported, GUI prompts are basically the only option considering that multiple files can be processed.

Yes, "--gui" is just for password prompts.

The original intend was to have 2 way of using this tools, the "standard" way, and as a batch process (So without any gui or blocking prompt). However, after reading your comments, it seems that it is more complex than the interest it have.
I think I will drop "--gui", "--password" and my "batch process" idea, could be implemented later if someone explicitly state the need.

However I will keep the password-protected file support: I see more and more phishing campaign using password-protected office file containing malicious macro / commands, so I [and probably others] have a need for that feature.

Plus, GUI prompts shouldn't be created from the server (for security reasons), but from requests made to Dom0 which always provides a GUI.

Can you explain more about that ?
Don't really see the issue with the [potentially infected] server showing up password prompt ?
[Except the: "malicious document exploiting a vulnerability in one of the tools used by server, to display a password prompt for the user. User that will mindlessly fill the field with the same password he use for every non-malicious office document. Giving the attacker a way to decrypt the others documents he stole before", but i don't see what I can do for this case (except a documentation on how to configure the disposableVM to disable network access and every non PdfConvert qubes RPC) ]

* Having to use raw sockets for multiple file type support... Is this really the only the way? sweat_smile

Yeah, not particularly elegant...
The problem I am trying to solve is that when I start the libreoffice process, I need to way a bit until it is ready to take requests.
I will check the documentation to see if there is something like "notify-me when ready" event, available on the command line.

@marmarek
Copy link
Member

* Plus, GUI prompts shouldn't be created from the server (for security reasons)

As @neowutran already said - having password prompt on the server side should be fine. This way, all the things that interacts with potentially malicious PDF file are isolated in that VM. Plus, it will allow to avoid protocol change (server will prompt for the password itself, no need to receive it from the client). That will be at the cost of having password prompt always interactive (no way to script it), but in my opinion it isn't big issue.

There is something wrong with commits here, a lot of conflicts. Most likely because of commits for the old version still present here. I think the easier way to fix it is to forcefully rebase the whole thing:

# we'll discard all the changes from the current branch, make a backup
# also make sure to commit everything that is useful first, otherwise it will be discarded too
git branch backup-pre-rebase HEAD
# assuming "origin" points at https://github.com/QubesOS/qubes-app-linux-pdf-converter
git fetch --tags origin
# verify signature
git verify-tag $(git describe origin/master)
# this discards every local modification and commit
git reset --hard origin/master
# then pickup just changed files from the backup (example list, I haven't checked what files you've modified)
git checkout backup-pre-rebase qubespdfconverter/client.py qubespdfconverter/client.py
# review changes, then commit and push
git diff --cached
git commit -a
git push --force

Generally I would recommend creating separate branch for pull requests (and various changes in general), instead of using master in your repo, but that's unrelated topic.

PS I haven't looked at the changes yet, only at comments here.

@ibokuri
Copy link
Contributor

ibokuri commented Jul 12, 2020

However I will keep the password-protected file support [...]

Oh, I want password-protected file support too, I just meant not with this specific implementation.

Don't really see the issue with the [potentially infected] server showing up password prompt ?

I guess I just thought untrusted VMs shouldn't draw prompts or something. Marek said it was fine though so forget about that part.

@ibokuri
Copy link
Contributor

ibokuri commented Jul 12, 2020

By the way, would you mind running some performance tests? We've been using this as a benchmark, and you can find my results here.

neowutran added a commit to neowutran/qubes-app-linux-pdf-converter that referenced this pull request Jul 14, 2020
@neowutran
Copy link
Author

neowutran commented Jul 14, 2020

@Bl0nd

time qvm-convert-pdf 253665-sdm-vol-1.pdf

OG server, new client (batch: 50, bulk): 5 min

: On my computer: 3 min 33

new server (as d2880ff), new client (as d2880ff): 2min10

neowutran added a commit to neowutran/qubes-app-linux-pdf-converter that referenced this pull request Jul 16, 2020
@neowutran
Copy link
Author

neowutran commented Jul 17, 2020

Things that doesn't work yet:

  • CI build [issue between pylint & UNO libreoffice API]

@neowutran
Copy link
Author

neowutran commented Jul 21, 2020

I think it is ready for another review.

Things that changed since last time:

  • Things we discussed previously should be fixed
  • Modification to fix to GUI progress bar (zenity)

(tested on archlinux, fc32 and buster)

CI build still doesn't work: Python still doesn't find the libreoffice UNO lib, no idea why.

@neowutran
Copy link
Author

@Bl0nd if you want to review it once more :)

@ibokuri
Copy link
Contributor

ibokuri commented Jul 29, 2020

Been a bit busy with some other stuff lately, sorry about that. I'll try to review it as soon as I can.

qubespdfconverter/server.py Outdated Show resolved Hide resolved
qubespdfconverter/server.py Outdated Show resolved Hide resolved
qubespdfconverter/server.py Outdated Show resolved Hide resolved
qubespdfconverter/server.py Show resolved Hide resolved
qubespdfconverter/server.py Outdated Show resolved Hide resolved
qubespdfconverter/server.py Outdated Show resolved Hide resolved
qubespdfconverter/server.py Outdated Show resolved Hide resolved
qubespdfconverter/server.py Outdated Show resolved Hide resolved
qubespdfconverter/server.py Outdated Show resolved Hide resolved
qubespdfconverter/server.py Outdated Show resolved Hide resolved
neowutran added a commit to neowutran/qubes-app-linux-pdf-converter that referenced this pull request Aug 1, 2020
@neowutran
Copy link
Author

Thanks for the review @Bl0nd !
(Normally) Applied nearly all your recommendations (I will comment on the one I willingly not applied)
Don't hesitate if you have more to say, and thanks again :)

@ibokuri
Copy link
Contributor

ibokuri commented Aug 1, 2020

Nice work! Btw, could you resolve the conversations you've dealt with? It'd make it easier to see what still needs to be worked on.

@neowutran
Copy link
Author

neowutran commented Aug 1, 2020

Done, 2 unresolved conversations.
For 1 I will send a new commit later this week end

One a side note, travis CI is still not working, python doesn't find the libreoffice dependency to resolve "uno". For the moment no idea why, but probably better to keep that for the end

From my understanding, it doesn't work because:
Python is run in a "virtualenv" so it can't access system libraries (so can't access the libreoffice python library).
There is a "system_site_packages" option to allow "virtualenv" to access system libraries, but it work only for python version that have been installed on the OS. CI OS is old and doesn't support a recent enough version of python for the "pdf-converter" so we need to use a downloaded version defined in the .travis file, so we can't use "system_site_packages".
As a workaround, some other project like unoconv download libreoffice and use the python binary embedded inside (see: https://github.com/unoconv/unoconv/blob/master/ci/linux.bash ).
But we also need to install some pip dependencies.

Update:
So I decided to ignore the error message instead of starting to do something very complex for nearly nothing ( 0302da8 )

@fepitre
Copy link
Member

fepitre commented Oct 22, 2020

I think first you can try to solve the Makefile conflict (rebase) then @marmarek?

@neowutran
Copy link
Author

Seems like the new version of libreoffice (>= 7.0) on archlinux introduced a bug/regression in the "storeAsURL" API for some file type (.docx). So WIP again to find a workaround / open bug ticket

@neowutran
Copy link
Author

I was unable to find an explanation of why the "decrypt" function stop working with libreoffice >= 7, so filled a bug ticket https://bugs.documentfoundation.org/show_bug.cgi?id=137926

@neowutran
Copy link
Author

neowutran commented Nov 8, 2020

Anyway it is ready for review, could instruct archlinux user to select "libreoffice-still" (instead of "libreoffice-fresh") until libreoffice team fix the issue.
Or need to dig the libreoffice git to find what was changed in XStorable / storeasurl for the handling of com::sun::star::document::MediaDescriptor parameters

@neowutran
Copy link
Author

Some idea of workaround for the libreoffice bug (I didn't found the motivation nor to [ learn c++ and learn libreoffice codebase ] nor to [ recompile and test libreoffice to find the faulty commit between 6.4 and 7.0] ):
Instead of using uno, just use the built in macro system.

  • Programmatically create a macro (aka create a file in ~/.config/libreoffice/4/user/basic/Standard/RANDOM_MODULE_NAME.xba)
    Something like that
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE script:module PUBLIC "-//OpenOffice.org//DTD OfficeDocument 1.0//EN" "module.dtd">
<script:module xmlns:script="http://openoffice.org/2000/script" script:name="Module1" script:language="StarBasic">REM  *****  BASIC  *****

Sub Main
   dim properties(1) as new com.sun.star.beans.PropertyValue
   url = convertToURL(&quot;/home/user/qubes-app-linux-pdf-converter/tests/files_success/doc.doc&quot;)
   properties(0).Name = &quot;Password&quot;
   properties(0).Value = &quot;toor&quot;
   properties(1).Name = &quot;Hidden&quot;
   properties(1).Value = True
   doc = StarDesktop.loadComponentFromUrl(url, &quot;_blank&quot;, 0, properties())
   dim properties2(0) as new com.sun.star.beans.PropertyValue
   doc.storeAsURL(&quot;file:////home/user/qubes-app-linux-pdf-converter/tests/files_success/tata.doc&quot;, properties2())
   StarDesktop.Terminate()
End Sub
</script:module>

Then call libreoffice

soffice --nologo --norestore --nodefault --headless 'macro:///Standard.Module1.Main' 

…able to libreoffice when launching the service, it is possible that this modification won't be required with the libreoffice version provided by the distribution, but for the moment I need to apply a patch for libreoffice
@neowutran
Copy link
Author

Found the commit that broke the behavior that I was relying on.
Sent a suggestion/proposition of patch to re-add the behavior (everything is in the libreoffice bug ticket).
Tested on my side, with the patch everything seems to be working correctly ( all my tests-cases work ).

Also, since I now understand this change of behavior, I wrote a workaround for this issue.

@neowutran neowutran changed the title WIP Add support for more file types + archlinux packaging Add support for more file types + archlinux packaging May 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants